Skip to content

pp_pack.c: Convert from using 'to_uvchr' functions to 'to_uv' #23564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

This series of commits converts the calls that convert from UTF-8 to code points from using the utf8n_to_uvchr() forms to instead use the newer utf8_to_uv_foo() forms.

One reason to do this is that the newer forms have a much easier interface for returning to the caller if the input was malformed. The old interface was arcane, and this file did not use it correctly. Since at least 5.8.9, just one of the calls in this file knew that the string was malformed. Everywhere else, it thought things were hunky dory when they were not.

That meant that the code to cope with malformedness was dead; it never got executed. This p.r. rectifies that. However, it turns out that t/utf8_decode.t was relying on one of the code paths being dead. I had to remove the check when it came alive.

So I took care to make each commit in this series change only one call; to aid in bisecting if field experience indicates other calls should not check for validity.

  • This set of changes requires a perldelta entry

Change a few lines that affect the next commits that use a coding style
not conforming to current expectations

And also add some blank lines to serve as "paragraph" markers between code
segments, and change the indentation of a few lines in preparation for
blocks being added and removed in the next commits.
This converts a call to utf8n_to_uvchr() to instead use
utf8_to_uv_flags().

This is part of the process of replacing all the '_uvchr' functions with
'_uv' functions, which are now preferred.  See perlapi.

There is a subtle change of behavior here when the string to convert is
malformed.  I believe this is a bug fix.  The macro NEXT_UNI_VAL assumed
the failure return of utf8n_to_uvchr() is 'retlen' being -1.  However,
that value is only ever returned if the flags passed to the function
include UTF8_CHECK_ONLY.  Inspection of the code showed that this macro
is never called with that flag; so this never currently fails.

I looked at Perl v5.8.9, and the situation was the same.  I doubt it has
changed until now.

This commit changes to use utf8_to_uv_flags(), and its failure return is
properly checked for.  The upshot is that this could now fail; whereas
it couldn't before, even though the clear intent of the code has been to
fail upon error.
Like the previous commit, this converts a call to utf8n_to_uvchr() to
instead use utf8_to_uv_flags().

And like that commit, the previous code thought it was checking for
failure and croaking when it occurred, but was checking the wrong way so
that it never could fail.  And now it does a proper check and can fail.
The new form croaks under failure; the previous didn't.  But to get here
the previous paragraph of code had to succeed, which means this should
too.

The flag in the changed call is the default for utf8_to_uv_or_die(), so
it can be removed.
This converts a call to utf8n_to_uvchr(), using instead utf8_to_uv().
The flag parameter UTF8_ALLOW_DEFAULT before this commit is the default
for utf8_to_uv(), so it can be omitted.

Like previous commits, the previous code thought it was checking for
failure and croaking, but was checking the wrong way so that it never
could fail.  In the new commit, I had to remove that checking, because
it caused test suite failures.
Like previous commits, this converts a call to utf8n_to_uvchr() to use
instead utf8_to_uv_flags().

And like previous commits, the previous code thought it was checking for
failure and croaking, but was checking the wrong way so that it never
could fail.  With the new commit, it can fail.

Also, the previous commit stored the result in an 'int', whose range can
be significantly less than the code point returned.  The value was
truncated without notice.  Now, I croak if the value got truncated.

I wonder what I should do, as this is for unpacking a 'c' format, which
is supposed to be a signed 8-bit value.
Like previous commits, this converts a call to utf8n_to_uvchr() to
instead use utf8_to_uv_flags().

Unlike those commits, the previous code did in fact properly check for
failure.  But upon failure, it semi-unsafely assumed the start byte was
accurate as to the intended byte length of the input character.  (It did
check that the returned value did not go past the end of the buffer, by
using UTF8_SAFE_SKIP() ).  The new function call always returns the
correct number of bytes, so just use that instead of trying to recompute
it.

The mechanism for deteriming failure no longer depends on the CHECK_ONLY
flag, so that is removed.
Prior to this commit, the code went through the buffer with warnings
turned off, but saving the fact if there were problems.  If so, it ran
through the buffer again, with warnings turned on to display those
problems.

I'm unsure of why it didn't output any warnings until it got to the end.
But utf8_to_uv_msgs() allows us to store up the messages and output them
later, without having to reparse
The reason for it to be more than this was removed in the previous
commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant